Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

between the first value and k value within the ߴ vector,

ݐ^୑୓ୗ୘ൌmax

∀௞^ቊ

∑

ሺݕ௜െߤ௫ሻ

௞

௜ୀଵ

⁄

ߪణሺଵ:௞ሻ

ቋ

(6.14)

MOST p value is also calculated using the permutation approach.

OSS

sum of ordered subset square t statistic algorithm (LSOSS) is a

method proposed to calculate a t statistic based on MOST [Wang

ya, 2010]. The LSOSS t statistic is calculated in two steps. In the

, an optimal cutting point is sought by examining the best

n of a vector of the case expressions in terms of a standard

. The vector of the case expressions is sorted and then a cutting

varied between two and the length of the vector minus two. For

ential cutting point, two standard deviations are calculated. The

hese two standard deviations is believed to reach a maximum at

rying cutting points. Suppose the optimal cutting point is denoted

e optimal standard deviation sum at the optimal cutting point is

by ߪ௬^∗. At this optimal cutting point, a LSOSS t statistic is

d using the following equation, where n and m are the lengths of

ors, namely x and y, respectively,

ݐ^{୐ୗ୓ୗୗ}ൌ

݇^∗൫ߤܡሺଵ:௞^∗ሻെߤ௫൯

൫ߪ௫൅ߪ௬^∗൯ሺ݊൅݉െ2ሻ

⁄

(6.15)

c assumption of the algorithm, which is named as discovering

ased on the tight Gaussian cluster (DOG), is that the majority of

ol expressions should have a small variance. Therefore, they form

tight Gaussian cluster [Yang and Yang, 2013]. It is assumed in